Covid19 Data
1 Chapter 1: Introduction
Team A are the following members: Amal Alqahtani, Jiaxiang Peng, Naureen Elahi, and Xinya Mu. You may find our work over on GitHub.
Coronavirus disease-19 (COVID-19) has spread rapidly around the world, creating unprecedented damage the world was not ready for. To date, the CDC states there are a total of 4,542, 579 cases and 152, 870 deaths in the United States (Cases in U.S, 2020). Many risk factors have been hypothesized to affect the case and death rates from the virus. We felt that a relevant discussion to have would be What are the most regions with the highest number of deaths? What can we say about patient demographics? Is race considered a significant risk factor for increased COVID-19 incidence in the United States?’ Are there any general trends amongst underlying health conditions? These questions are all suited to Exploratory Data Analysis (EDA), and with these questions in mind, we want to see if we could find data on COVID-19 that would be readily available for us to analyze. Eventually, our question morphed into the following: What are the factors (i.e. patient demographics, social determinants of health, environmental variables, underlying health conditions, country of origin) affecting COVID-19 numbers of cases and death rate among different geographical locations in US?
We were able to find a dataset called Covid-19-Dataset on Github over here: https://github.com/johndurbin93/Covid-19-Dataset. This dataset includes COVID-19 confirmed case number and death number through April 14, 2020 which were obtained for each U.S. county from the Center for Systems Science and Engineering (CSSE) Coronavirus Resource Center at Johns Hopkins University. Race demographics for counties was obtained from the County Health Rankings and Roadmaps Program database. Daily temperature data for counties was obtained from the National Oceanic and Atmospheric Administration. This data was compiled by a group of reserchers.
The report is organized as follows:
- Description of the Data (explanation of the dataset and its variables),
- Demographics Data of the patients
- Independent Variables EDA: Slicing the Data for an Overview
- Independent Variables EDA: Boxplots, Scatterplots, ANOVA, & Chi-Square
- Linear Regression Model
- Conclusion
2 Chapter 2: Description of the Data
2.1 Source Data
The data looks like the following:
tibble [3,144 x 82] (S3: tbl_df/tbl/data.frame)
$ Province : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
$ State : chr [1:3144] "New York" "New York" "New York" "New York" ...
$ Latitude : num [1:3144] 40.8 40.7 40.9 41.2 41.8 ...
$ Longitude : num [1:3144] -74 -73.6 -72.8 -73.8 -87.8 ...
$ Tests : num [1:3144] 499143 499143 499143 499143 110616 ...
$ Days Since 1st Case : num [1:3144] 44 41 38 43 82 35 41 80 39 38 ...
$ total_cases : num [1:3144] 110465 25250 22691 20191 16323 ...
$ deaths : num [1:3144] 7905 1001 608 596 577 ...
$ Population (for demographic %'s) : chr [1:3144] "8623000" "1358343" "1481093" "967612" ...
$ % less than 18 years of age : chr [1:3144] "20.9" "21.459675499999999" "21.134324500000002" "21.900513799999999" ...
$ % 65 and over : chr [1:3144] "14.1" "17.763039200000001" "16.862951899999999" "17.053116299999999" ...
$ % Black : chr [1:3144] "24.3" "11.6331442" "7.3924459799999998" "13.8042935" ...
$ % American Indian & Alaska Native : chr [1:3144] "0.4" "0.54294092000000005" "0.61373593999999998" "0.95647842000000005" ...
$ % Asian : chr [1:3144] "13.9" "10.4504532" "4.1896086199999996" "6.43553408" ...
$ % Native Hawaiian/Other Pacific Islander : chr [1:3144] "0.1" "0.1" "9.5899999999999999E-2" "0.13228443000000001" ...
$ % Hispanic : chr [1:3144] "29.1" "17.231362000000001" "19.775260599999999" "25.140345499999999" ...
$ % Non-Hispanic White : chr [1:3144] "32.1" "59.333835399999998" "67.190378999999993" "53.088118000000001" ...
$ % Not Proficient in English : chr [1:3144] "9" "5.3660427200000003" "4.00639637" "6.3180527499999997" ...
$ % Female : chr [1:3144] "52.3" "51.306334300000003" "50.771693599999999" "51.559095999999997" ...
$ % Rural : chr [1:3144] "0" "0.19223132000000001" "2.6011316799999999" "3.2734774500000001" ...
$ Population Density per Square mile of Land (2010) : num [1:3144] 69468 4705 1637 2205 5495 ...
$ Housing Density Per Square Mile of Land : num [1:3144] 37106 1645 625 861 2306 ...
$ Avg Daily March 2011 Sunlight (KJ/m2) Missing HI and AK : num [1:3144] 16233 16649 16539 15031 14299 ...
$ GDP 2018 : num [1:3144] 600244287 81196003 81211899 73404644 362063569 ...
$ GDP/capita : num [1:3144] 69.6 59.8 54.8 75.9 69.9 ...
$ Percentage Living in Poverty, All Ages, 2016 : num [1:3144] 17.2 6.1 7.6 10 15 22.9 6.9 16.3 14.4 15.6 ...
$ Air Quality, Annual Average Ambient Concentrations of PM2.5, 2014 : chr [1:3144] "10.8" "10" "9" "10.4" ...
$ Primary Care Physicians Ratio : chr [1:3144] "31.417361111111109" "29.834027777777777" "56.709027777777777" "30.334027777777777" ...
$ Dentist Ratio : chr [1:3144] "23.334027777777777" "34.500694444444441" "50.209027777777777" "37.834027777777777" ...
$ Mental Health Provider Ratio : chr [1:3144] "4.834027777777778" "13.792361111111111" "15.625694444444443" "10.750694444444443" ...
$ High School Graduation Rate : chr [1:3144] "74.536495200000005" "90.769602500000005" "89.560601599999998" "89.554779199999999" ...
$ % Some College : chr [1:3144] "84.074597800000006" "75.579882699999999" "67.067068199999994" "71.893479799999994" ...
$ % Unemployed : chr [1:3144] "3.6720665100000001" "3.5355112100000001" "3.8509406199999998" "3.8880261699999998" ...
$ % Children in Poverty : chr [1:3144] "19.7" "7.6" "9.4" "10.3" ...
$ Income Inequality Ratio (80th%/20th%) : chr [1:3144] "9.2065919600000008" "4.5137498100000002" "4.3752126000000002" "6.18534249" ...
$ % Single-Parent Households : chr [1:3144] "39.575203500000001" "19.238140600000001" "23.6102569" "25.424638399999999" ...
$ Social Association Rate : chr [1:3144] "12.8789886" "7.9882352399999998" "6.7450214400000004" "8.3754656999999995" ...
$ Violent Crime Rate : chr [1:3144] "586.40744800000004" "143.663387" "124.039181" "220.606166" ...
$ Air pollution: Average Daily PM2.5 : chr [1:3144] "10.8" "10" "9" "10.4" ...
$ Presence of Drinking Water Violation : chr [1:3144] "No" "No" "No" "Yes" ...
$ % Severe Housing Problems : chr [1:3144] "24.378637699999999" "21.324080599999998" "22.888761800000001" "24.236306200000001" ...
$ Housing: Severe Cost Burden : chr [1:3144] "19.610767299999999" "19.1674103" "20.427237699999999" "20.895964899999999" ...
$ Housing: Overcrowding : chr [1:3144] "5.4547143900000004" "2.5236808000000002" "2.6421104199999998" "4.2602996299999996" ...
$ Housing: Inadequate Facilities : chr [1:3144] "1.2204915199999999" "0.72802853000000001" "0.78609931" "0.73443351999999995" ...
$ % Drive Alone to Work : chr [1:3144] "6.0475223400000004" "68.609857500000004" "79.604339800000005" "57.587820100000002" ...
$ % Long Commute - Drives Alone : chr [1:3144] "66.7" "45.7" "41.9" "41.2" ...
$ Sleep <7 Hours_Percent : chr [1:3144] "NA" "38.049835600000002" "35.608102700000003" "33.101763800000001" ...
$ Sleep <7 Hours_CI_Low : chr [1:3144] "NA" "37.488497199999998" "34.960704200000002" "32.608553999999998" ...
$ Sleep <7 Hours_CI_High : chr [1:3144] "NA" "38.576512200000003" "36.198949800000001" "33.594731299999999" ...
$ Diabetes Total Percentage : num [1:3144] 6.5 7.2 6.8 6.4 9 10.3 6.8 8.1 6.9 8.2 ...
$ Diabetes Male Percentage : num [1:3144] 6.7 8.5 7.7 6.7 9.7 10.7 7.1 8.6 7 8.7 ...
$ Diabetes Female Percentage : num [1:3144] 6.3 6.2 6 6.2 8.4 10.1 6.6 7.7 6.8 7.9 ...
$ Coronary Heart Disease Death Rate per 100,000, All Ages, All Races/Ethnicities, Both Genders, 2014-2016: num [1:3144] 100.4 142.4 120.1 97.6 95.2 ...
$ Hypertension Death Rate per 100,000 (any mention), 35+, All Races/Ethnicities, Both Genders, 2014-2016 : num [1:3144] 232 153 181 124 191 ...
$ Obesity, Age-Adjusted Percentage, 20+. 2015 : num [1:3144] 15.9 22.5 23.6 20.2 27.2 34.1 22.4 21.2 23.4 23.9 ...
$ % Fair or Poor Health : chr [1:3144] "15.610279800000001" "12.0544118" "13.0711332" "14.8011888" ...
$ Average Number of Physically Unhealthy Days : chr [1:3144] "3.5938226700000002" "2.8691053700000002" "3.1473144999999998" "3.1513169799999998" ...
$ Average Number of Mentally Unhealthy Days : chr [1:3144] "3.97126146" "3.4601849699999998" "3.9316660200000002" "3.9107989299999999" ...
$ % Low Birthweight : chr [1:3144] "8.2870096600000007" "7.8873580399999996" "7.7408509700000003" "7.95359718" ...
$ % Smokers (adults) : chr [1:3144] "12.418234200000001" "11.225364600000001" "12.625481499999999" "11.371546" ...
$ % Adults with Obesity : chr [1:3144] "14.6" "23.6" "24.6" "20.7" ...
$ Food Environment Index : chr [1:3144] "8.3000000000000007" "9.6999999999999993" "9.3000000000000007" "9.1" ...
$ % Physically Inactive : chr [1:3144] "17.5" "22.8" "24.2" "21.2" ...
$ % With Access to Exercise Opportunities : chr [1:3144] "100" "98.858183299999993" "93.3366592" "99.621119899999997" ...
$ % Excessive Drinking : chr [1:3144] "24.812851999999999" "18.439903699999999" "18.671426799999999" "18.011370899999999" ...
$ % Uninsured : chr [1:3144] "6.15572813" "5.32768102" "5.4469207300000004" "6.9293390300000004" ...
$ Preventable Hospitalization Rate (Preventable hospital stays) : chr [1:3144] "3082" "3588" "4339" "3870" ...
$ % With Annual Mammogram : chr [1:3144] "39" "45" "42" "46" ...
$ % Flu Vaccinated : chr [1:3144] "46" "52" "51" "51" ...
$ Chronic Respiratory Disease: mortality rate per 100K (2014) : chr [1:3144] "23.47" "29.03" "38.590000000000003" "31.82" ...
$ Liver Disease: crude mortality rate per 100K (1999-2018) : chr [1:3144] "7.3202151400000002" "7.8321364999999998" "9.3999156199999998" "8.6888457900000002" ...
$ Liver Disease: % of Total Deaths (1999-2018) : chr [1:3144] "2.72836E-3" "2.4593800000000002E-3" "3.2464299999999998E-3" "1.92728E-3" ...
$ Liver Disease: crude mortality rate per 100K (2018) : chr [1:3144] "6.6924499900000001" "8.7606738499999999" "11.478009800000001" "8.9912072199999997" ...
$ Liver Disease: % of Total Deaths (2018) : chr [1:3144] "1.9492800000000001E-3" "2.1281199999999998E-3" "3.04017E-3" "1.5558499999999999E-3" ...
$ Avg Temp Peak Growth-10 Rate : num [1:3144] 8.41 7.41 6.86 5.88 2.25 ...
$ Avg Temp 10 Before First-Current : num [1:3144] 8.33 7.78 7.1 6.68 2.55 ...
$ Avg Temp First-Current : num [1:3144] 9.23 8.36 7.82 7.53 3.34 ...
$ First Case : POSIXct[1:3144], format: "2020-03-02" "2020-03-05" ...
$ Stay At Home : POSIXct[1:3144], format: "2020-03-22" "2020-03-22" ...
$ No Cases : num [1:3144] 0 0 0 0 0 0 0 0 0 0 ...
$ No Stay At Home Order : num [1:3144] 0 0 0 0 0 0 0 0 0 0 ...
$ Stay At Home Order After First Case : num [1:3144] 1 1 1 1 1 1 1 1 1 1 ...
# A tibble: 3,144 x 16
Province State total_cases deaths `Population (fo~ `% less than 18~
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 New Yor~ New ~ 110465 7905 8623000 20.9
2 Nassau New ~ 25250 1001 1358343 21.459675499999~
3 Suffolk New ~ 22691 608 1481093 21.134324500000~
4 Westche~ New ~ 20191 596 967612 21.900513799999~
5 Cook Illi~ 16323 577 5180493 21.8062644
6 Wayne Mich~ 12209 820 1753893 23.617233200000~
7 Bergen New ~ 10426 550 936692 21.176117699999~
8 Los Ang~ Cali~ 10047 360 10105518 21.660374099999~
9 Rockland New ~ 8335 263 325695 28.158860300000~
10 Hudson New ~ 8242 277 676061 20.4450486
# ... with 3,134 more rows, and 10 more variables: `% 65 and over` <chr>, `%
# Black` <chr>, `% American Indian & Alaska Native` <chr>, `% Asian` <chr>,
# `% Native Hawaiian/Other Pacific Islander` <chr>, `% Hispanic` <chr>, `%
# Non-Hispanic White` <chr>, `% Not Proficient in English` <chr>, `%
# Female` <chr>, poor_health <chr>
The Covid19 dataset has 82 columns and 3144 rows/entries, for a total of 257808 individual data points. Out of 82, we select the following variables to do EDA:
- Province
- State
- State Code
- Tests
- Total cases
- Deaths
- Population (for demographic %’s)
- % less than 18 years of age
- % 65 and over
- % Black
- % American Indian & Alaska Native
- % Asian
- % Native Hawaiian/Other Pacific Islander
- % Hispanic
- % Non-Hispanic White
- % Not Proficient in English
- % Female
- No Cases
- No Stay At Home Order
- Stay At Home Order After First Case
- Percentage Living in Poverty
- Social Association Rate
To prepare our data for EDA we clean the dataset and remove all NAs.
tibble [3,144 x 16] (S3: tbl_df/tbl/data.frame)
$ Province : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
$ State : chr [1:3144] "New York" "New York" "New York" "New York" ...
$ total_cases : num [1:3144] 110465 25250 22691 20191 16323 ...
$ deaths : num [1:3144] 7905 1001 608 596 577 ...
$ Population (for demographic %'s) : chr [1:3144] "8623000" "1358343" "1481093" "967612" ...
$ % less than 18 years of age : chr [1:3144] "20.9" "21.459675499999999" "21.134324500000002" "21.900513799999999" ...
$ % 65 and over : chr [1:3144] "14.1" "17.763039200000001" "16.862951899999999" "17.053116299999999" ...
$ % Black : chr [1:3144] "24.3" "11.6331442" "7.3924459799999998" "13.8042935" ...
$ % American Indian & Alaska Native : chr [1:3144] "0.4" "0.54294092000000005" "0.61373593999999998" "0.95647842000000005" ...
$ % Asian : chr [1:3144] "13.9" "10.4504532" "4.1896086199999996" "6.43553408" ...
$ % Native Hawaiian/Other Pacific Islander: chr [1:3144] "0.1" "0.1" "9.5899999999999999E-2" "0.13228443000000001" ...
$ % Hispanic : chr [1:3144] "29.1" "17.231362000000001" "19.775260599999999" "25.140345499999999" ...
$ % Non-Hispanic White : chr [1:3144] "32.1" "59.333835399999998" "67.190378999999993" "53.088118000000001" ...
$ % Not Proficient in English : chr [1:3144] "9" "5.3660427200000003" "4.00639637" "6.3180527499999997" ...
$ % Female : chr [1:3144] "52.3" "51.306334300000003" "50.771693599999999" "51.559095999999997" ...
$ poor_health : chr [1:3144] "15.610279800000001" "12.0544118" "13.0711332" "14.8011888" ...
3 Chapter 3: Independent Variables EDA
3.1 United States COVID-19 Cases and Deaths by Provinces (Cities)
3.1.1 What are the top 15 Provinces based on the number of cases?
The following bar chart shows the top 15 cities by number of Covid-19 cases.
3.1.2 What are the top 15 Provinces based on the number of deaths?
The following bar chart shows the top 15 cities by number of deaths.
3.1.3 What is the average cases for each State?
State total_cases
1 Alabama 59.03
2 Alaska 9.83
3 Arizona 258.60
4 Arkansas 19.44
5 California 437.43
6 Colorado 122.41
7 Connecticut 1682.00
8 Delaware 638.33
9 District of Columbia 2058.00
10 Florida 323.07
11 Georgia 85.74
12 Hawaii 101.60
13 Idaho 33.32
14 Illinois 227.48
15 Indiana 94.12
16 Iowa 19.20
17 Kansas 13.84
18 Kentucky 17.32
19 Louisiana 335.34
20 Maine 45.88
21 Maryland 394.75
22 Massachusetts 1843.87
23 Michigan 316.96
24 Minnesota 19.10
25 Mississippi 37.68
26 Missouri 40.98
27 Montana 7.18
28 Nebraska 9.48
29 Nevada 184.35
30 New Hampshire 103.50
31 New Jersey 3196.29
32 New Mexico 40.82
33 New York 3274.52
34 North Carolina 51.20
35 North Dakota 6.45
36 Ohio 82.81
37 Oklahoma 28.51
[ reached 'max' / getOption("max.print") -- omitted 14 rows ]
3.1.4 What is the average deaths for each State?
State deaths
1 Alabama 1.701
2 Alaska 0.172
3 Arizona 7.133
4 Arkansas 0.427
5 California 13.328
6 Colorado 5.109
7 Connecticut 83.375
8 Delaware 14.333
9 District of Columbia 67.000
10 Florida 7.836
11 Georgia 3.270
12 Hawaii 1.800
13 Idaho 0.750
14 Illinois 8.510
15 Indiana 4.207
16 Iowa 0.444
17 Kansas 0.657
18 Kentucky 0.900
19 Louisiana 15.922
20 Maine 1.250
21 Maryland 12.667
22 Massachusetts 49.600
23 Michigan 21.133
24 Minnesota 0.920
25 Mississippi 1.366
26 Missouri 1.284
27 Montana 0.143
28 Nebraska 0.161
29 Nevada 7.059
30 New Hampshire 0.300
31 New Jersey 133.476
32 New Mexico 0.939
33 New York 174.871
34 North Carolina 1.130
35 North Dakota 0.151
36 Ohio 3.705
37 Oklahoma 1.403
[ reached 'max' / getOption("max.print") -- omitted 14 rows ]
3.1.5 Which cities had the greatest % of population of people with poor health?
# A tibble: 10 x 2
Province poor_health
<chr> <dbl>
1 Zavala 41.0
2 Starr 38.9
3 Brooks 38.2
4 Holmes 37.6
5 Willacy 37.5
6 Jefferson 37.5
7 East Carroll 37.5
8 Claiborne 37.3
9 Kusilvak 36.9
10 Humphreys 36.2
3.2 Patient Demographics
3.2.1 What are the patient demographics?
[1] "D:/study/6101/repo/Data_Science"
| TC | Population | young | old | black | AIAN | Asian | NH | Hispanic | NHW | Female | Poverty | Social | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | 0 | 88 | 0.0 | 4.8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.6 | 2.7 | 26.8 | 3.4 | 0.0 |
| Q1 | 2 | 11034 | 20.1 | 16.3 | 0.7 | 0.4 | 0.5 | 0.0 | 2.4 | 64.7 | 49.4 | 11.4 | 8.2 |
| Median | 9 | 25758 | 22.1 | 19.0 | 2.2 | 0.6 | 0.7 | 0.1 | 4.4 | 83.5 | 50.3 | 14.8 | 11.1 |
| Mean | 191 | 105871 | 22.1 | 19.3 | 8.8 | 2.4 | 1.5 | 0.1 | 9.6 | 76.2 | 49.9 | 15.9 | 11.6 |
| Q3 | 39 | 67013 | 23.8 | 21.8 | 9.6 | 1.3 | 1.4 | 0.1 | 9.9 | 92.3 | 51.0 | 19.0 | 14.4 |
| Max | 110465 | 10105518 | 42.0 | 57.6 | 85.4 | 92.5 | 43.4 | 48.9 | 96.4 | 97.9 | 56.9 | 48.6 | 52.3 |
From the average of the output results, we can see that the average proportion of teenagers under the age of 18 is 22.1%, and the average proportion of people over 65 is 19.3%. The largest number of all races is Non-Hispanic White, with an average proportion of 76.2. The average proportion of women is 49.9, the average proportion of the poor is 15.9%, and the average of the Social Association Rate is 11.6. We divide the data into four levels according to total cases.
3.2.2 Which race is the majority of the sample?
According to the average value, we get a pie chart of race proportions, from which we can see the overall proportions of different races. In the following, we will study the proportion of which race is related to the number of confirmed cases and the number of deaths.
3.3 Stay at home policy in each province
3.4 Underlying Health Conditions
3.4.1 Are there any common underlying health conditions?
3.5 Impact of Temperature
4 Chapter 4: Independent Variables EDA: Boxplots, Scatterplots, ANOVA, & Chi-Square
First, we divide the total cases into different levels according to the quartile value and the median for further analysis.
[1] 0 2 9 39 110465
Shapiro-Wilk normality test
data: df2$TC
W = 0.05, p-value <0.0000000000000002
Bartlett test of homogeneity of variances
data: TC by rank
Bartlett's K-squared = 25341, df = 3, p-value <0.0000000000000002
The Shapiro-Wilk test is used to test whether the data conforms to the normal distribution.
H0: The sample data is not significantly different from the normal distribution
H1: The sample data is significantly different from the normal distribution
The p-value is less than 0.05, the null hypothesis is rejected, and the total cases do not conform to the normal distribution.
Test for homogeneity of variance(Bartlett test)
H0: Data with the same variance at different levels
H1: Data without the same variance at different levels
The result shows that the p value is less than 0.05, rejecting the null hypothesis, and total cases do not meet the homogeneity of variance.
4.5 SMART Question:
4.6 SMART Question:
4.7 SMART Question:
5 Chapter 5: Linear Regression Model
5.1 SMART Question: What factors influence the death rate the most?
6 Chapter 6: LASSO & Ridge Regression
We convert the two variables Presence of Drinking Water Violation and Stay At Home Order After First Case into categorical variables. Then began to fit LASSO regression and ridge regression. We normalize the data, then split the data into training and test set, so that we can estimate test errors. The split will be used here for Lasso and later for Ridge regression. For brevity, we selected 34 variables for the following analysis.
6.1 LASSO Regression
We draw the plot for different \(\lambda\) values to see the overall trend.
[1] "D:/study/6101/repo/Data_Science"
lowest lamda from CV: 0.00246
We see that the lowest MSE is when \(\lambda\) appro = 0.002.
Mean MSE for best Lasso lamda: 0.203
All the coefficients :
(Intercept) population young old
-0.00301 0.26499 0.03709 0.02258
black AIAN Asian NH
0.00000 -0.00339 -0.09691 -0.00689
Hispanic NHW Female Rural
-0.00461 0.00751 -0.01883 0.02263
Population.Density
0.11744
The non-zero coefficients :
(Intercept) population young old
-0.00301 0.26499 0.03709 0.02258
AIAN Asian NH Hispanic
-0.00339 -0.09691 -0.00689 -0.00461
NHW Female Rural Population.Density
0.00751 -0.01883 0.02263 0.11744
From LASSO regression, the coefficients of 11 variables are not zero, the coefficients of the remaining variables become zero. From the results, we can see that race, gender, age, population, population density and rural proportions will all have an impact on total cases.
We then calculate the R squared of lasso regression, which is 0.164.
6.2 Ridge Regression
[1] 33 100
(Intercept) population young old
-0.000006691 0.000051520 -0.000000684 -0.000005816
black AIAN Asian NH
0.000003757 -0.000001430 0.000020005 0.000001083
Hispanic NHW Female Rural
0.000006572 -0.000009388 0.000004852 -0.000012388
Population.Density Housing.Density Sunlight GDP
0.000077406 0.000078049 0.000001705 0.000050606
Poverty Unemployed Children.Poverty Income.Inequality
-0.000002654 -0.000001358 -0.000003231 0.000013610
Social PM2.5 WaterYes SHP
-0.000003449 0.000003749 0.000001931 0.000014048
poorhealth Unhealthy.Days smokers Obesity
-0.000003469 -0.000004822 -0.000007770 -0.000011988
Physically.ina WAEO CRD Temp
-0.000007008 0.000010306 -0.000010943 -0.000001884
Order1
0.000010345
(Intercept) population young old
-0.0001074 0.0008357 -0.0000112 -0.0000937
black AIAN Asian NH
0.0000609 -0.0000231 0.0003231 0.0000169
Hispanic NHW Female Rural
0.0001059 -0.0001515 0.0000785 -0.0001996
Population.Density Housing.Density Sunlight GDP
0.0012570 0.0012672 0.0000273 0.0008199
Poverty Unemployed Children.Poverty Income.Inequality
-0.0000426 -0.0000217 -0.0000518 0.0002207
Social PM2.5 WaterYes SHP
-0.0000554 0.0000606 0.0000306 0.0002268
poorhealth Unhealthy.Days smokers Obesity
-0.0000558 -0.0000776 -0.0001251 -0.0001935
Physically.ina WAEO CRD Temp
-0.0001124 0.0001658 -0.0001765 -0.0000306
Order1
0.0001663
Because the ridge regression uses the “L2 norm”, the coefficients are expected to be smaller when \(\lambda\) is large. Our “mid-point” (the 50-th) of \(\lambda\) equals to 11498, and the sum of squares of coefficients = 0. Compared to the 60-th value (we have a decreasing sequence) \(\lambda\) of = 705, we find the sum of squares of the coefficients to be 0.002, about 16 times larger.
We can use the predict function for various purposes, such as getting the predicted coefficients for \(\lambda\)=50, for example.
(Intercept) population young old
-0.0012334 0.0110737 -0.0001707 -0.0010885
black AIAN Asian NH
0.0007856 -0.0002850 0.0039843 0.0000888
Hispanic NHW Female Rural
0.0012624 -0.0018608 0.0009939 -0.0023810
Population.Density Housing.Density Sunlight GDP
0.0170830 0.0172559 0.0002979 0.0108541
Poverty Unemployed Children.Poverty Income.Inequality
-0.0004844 -0.0002253 -0.0005805 0.0029666
Social PM2.5 WaterYes SHP
-0.0006430 0.0007782 0.0002744 0.0028497
poorhealth Unhealthy.Days smokers Obesity
-0.0006499 -0.0009101 -0.0014932 -0.0024228
Physically.ina WAEO CRD Temp
-0.0012813 0.0019933 -0.0022012 -0.0004192
Order1
0.0019621
Then we use the separated training set and test set to see the test error.
The test set mean squared error (MSE) is 0.174. (We are using standardized scores for \(\lambda = 4\).)
On the other hand, for the null model (\(\lambda\) approaches infinity), the MSE can be found to be 0.244. So \(\lambda = 4\) reduces the variance by about half, at the expense of bias.
We could have also used a large \(\lambda\) value to find the MSE for the null model. These two methods yield essentially the same answer of 0.244.
(Intercept) population young old
-0.00192 0.37063 0.03949 0.01311
black AIAN Asian NH
0.10907 0.04541 -0.15188 0.00242
Hispanic NHW Female Rural
0.07496 0.17384 -0.02449 0.03603
Population.Density Housing.Density Sunlight GDP
0.08226 0.73164 -0.00993 -0.07849
Poverty Unemployed Children.Poverty Income.Inequality
-0.00886 0.00629 -0.05258 0.03361
Social PM2.5 WaterYes SHP
0.00193 -0.00942 0.02051 0.03758
poorhealth Unhealthy.Days smokers Obesity
0.14088 -0.05720 -0.06817 -0.00413
Physically.ina WAEO CRD Temp
0.01950 -0.01367 0.01171 -0.03587
Order1
-0.02147
Now for the other extreme special case of small \(\lambda\), which is the ordinary least square (OLS) model. We can first use the ridge regression result to predict the \(\lambda\) =0 case. The MSE was found to be 0.201 using this result.
We can also build the OLS model directly, caculate MSE.
Call:
lm(formula = TC ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-5.814 -0.051 0.005 0.062 6.966
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.00191 0.01929 -0.10 0.9211
population 0.37068 0.03979 9.32 < 0.0000000000000002 ***
young 0.03947 0.01958 2.02 0.0440 *
old 0.01310 0.02219 0.59 0.5552
black 0.11054 0.18067 0.61 0.5408
AIAN 0.04602 0.07642 0.60 0.5472
Asian -0.15159 0.03665 -4.14 0.000038 ***
NH 0.00245 0.01160 0.21 0.8328
Hispanic 0.07621 0.15610 0.49 0.6255
NHW 0.17577 0.23888 0.74 0.4620
Female -0.02448 0.01371 -1.79 0.0743 .
Rural 0.03601 0.01809 1.99 0.0468 *
Population.Density 0.08178 0.07716 1.06 0.2894
Housing.Density 0.73210 0.07416 9.87 < 0.0000000000000002 ***
Sunlight -0.00996 0.01842 -0.54 0.5889
[到达getOption("max.print") -- 略过18行]]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.372 on 1252 degrees of freedom
Multiple R-squared: 0.923, Adjusted R-squared: 0.921
F-statistic: 469 on 32 and 1252 DF, p-value: <0.0000000000000002
The MSE for OLS regression is 0.135
7 Chapter 7: Conclusion
8 Chapter 8: Bibliography
Cases in the U.S. (2020, August 01). Retrieved August 01, 2020, from https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html